Project Overview

Current project tree

.
├── LICENSE
├── README.md
├── cicd.png
├── config
│   ├── config.yaml
│   ├── samples.tsv
│   └── units.tsv
├── dags
│   ├── rulegraph.png
│   └── rulegraph.svg
├── data
│   ├── metadata
│   ├── metadatarun_accessions.txt
│   ├── reads
│   └── temp
├── images
│   ├── bkgd.png
│   ├── bkgd1.png
│   ├── bkgd2.png
│   ├── cicd.png
│   ├── imap_all_parts.png
│   ├── metadata.png
│   ├── project_tree.txt
│   ├── smkreport
│   └── sra_run_selector.png
├── imap-metadata-profiling.Rproj
├── imap-template.html
├── index.Rmd
├── library
│   ├── apa.csl
│   ├── imap.bib
│   └── references.bib
├── refer
│   ├── bioinfo_pipelines.Rmd
│   ├── fetch_sra_metadata.py
│   ├── mapping_files.Rmd
│   ├── mothurError.sh
│   ├── mothurMock.sh
│   ├── mothurReferences.sh
│   ├── mothurShared.sh
│   ├── mothurSplitShared.sh
│   ├── mothur_design_file.R
│   ├── mothur_mapping_file.R
│   ├── preprocess_tools.Rmd
│   ├── project_overview.Rmd
│   ├── read_csv.py
│   ├── seqkit_stat_1.sh
│   ├── sequencing_data.Rmd
│   ├── software.Rmd
│   └── subset_fastq.sh
├── report.html
├── resources
├── results
├── styles.css
└── workflow
    ├── Snakefile
    ├── envs
    ├── reports
    ├── rules
    ├── schemas
    └── scripts

18 directories, 43 files



Current snakemake workflow






Sample Metadata overview

What is metadata?

  • Metadata is a set of data that describes and provides information about other data. It is commonly defined as data about data.
  • Sample metadata described in this book refers to the description and context of the individual sample collected for a specific microbiome study.


Metadata structure

  • Metadata collected at different stages (Figure 1) are typically organized in an Excel or Google spreadsheet where:
    • The metadata table columns represent the properties of the samples.
    • The metadata table rows contain information associated with the samples.
    • Typically, the first column of sample metadata is Sample ID, which designates the key associated to individual sample
    • Sampl ID must be unique.


Embedded metadata

  • In most cases, you will find the metadata detached from the experimental data.
  • Embedded metadata integrates the experimental data especially for graphics.
  • Major microbiome analysis platforms require sample metadata, commonly referred to as mapping file when performing downstream analysis.


Explore SRA metadata

  • Typically, after sequencing the microbiome DNA, the investigators are encouraged to deposit the sequence reads in a public repository. The Sequence Read Archive (SRA) is currently the best bioinformatics database for read information. The good thing about SRA is that it integrates data from the NCBI, the European Bioinformatics Institute (EBI), and the DNA Data Bank of Japan (DDBJ).
  • This demo uses metadata associated with four microbiome BioProjects, including:
    • PRJNA477349: 16S rRNA from bushmeat samples collected from Tanzania Metagenome
    • PRJNA685168: Multi-omics suggest diverse mechanisms for response to biologic therapies in IBD
    • PRJEB21612: Alterations of the gut microbiome in hypertension
    • PRJNA802976: Changes to Gut Microbiota following Systemic Antibiotic Administration in Infants

Example screen shot of SRA Run Selector for metadata associated with the NCBI-SRA bioproject number PRJNA477349



Filtering metadata

  • You may want the sample metadata to include a few desired variables.
  • It is a good habit to rename, modify or replace longer column names with meaningful names.
  • We will select a few columns to create a desired metadata for downstream analyses.



Column names of selected Projects for demo

  • BioProject number:
    • PRJNA477349: Multispecies 16S rRNA from bushmeat samples collected from Tanzania.

    • PRJNA685168: Multi-omics response to biologic therapies in IBD.


Sample collection points

  • Does the metadata contains latitudes and longitudes (lat-lon) of the collection point?
  • You may consider dropping a pin on exact location.
  • The leaflet R package can do a great job in dropping a pin on the corresponding coordinate.
  • Note that samples collected on the same coordinate will overlap.
  • You can zoom in-out to expand or minimize the map.
  • You can also mouse over the pin to see the variable label.


Exploring sample metadata

What is metadata?

  • Metadata is a set of data that describes and provides information about other data. It is commonly defined as data about data.
  • Sample metadata described in this book refers to the description and context of the individual sample collected for a specific microbiome study.


Metadata structure

  • Metadata collected at different stages (Figure 1) are typically organized in an Excel or Google spreadsheet where:
    • The metadata table columns represent the properties of the samples.
    • The metadata table rows contain information associated with the samples.
    • Typically, the first column of sample metadata is Sample ID, which designates the key associated to individual sample
    • Sampl ID must be unique.


Embedded metadata

  • In most cases, you will find the metadata detached from the experimental data.
  • Embedded metadata integrates the experimental data especially for graphics.
  • Major microbiome analysis platforms require sample metadata, commonly referred to as mapping file when performing downstream analysis.


Explore SRA metadata

  • Typically, after sequencing the microbiome DNA, the investigators are encouraged to deposit the sequence reads in a public repository. The Sequence Read Archive (SRA) is currently the best bioinformatics database for read information. The good thing about SRA is that it integrates data from the NCBI, the European Bioinformatics Institute (EBI), and the DNA Data Bank of Japan (DDBJ).
  • This demo uses metadata associated with four microbiome BioProjects, including:
    • PRJNA477349: 16S rRNA from bushmeat samples collected from Tanzania Metagenome
    • PRJNA685168: Multi-omics suggest diverse mechanisms for response to biologic therapies in IBD
    • PRJEB21612: Alterations of the gut microbiome in hypertension
    • PRJNA802976: Changes to Gut Microbiota following Systemic Antibiotic Administration in Infants

Example screen shot of SRA Run Selector for metadata associated with the NCBI-SRA bioproject number PRJNA477349



Filtering metadata

  • You may want the sample metadata to include a few desired variables.
  • It is a good habit to rename, modify or replace longer column names with meaningful names.
  • We will select a few columns to create a desired metadata for downstream analyses.



Column names of selected Projects for demo

  • BioProject number:
    • PRJNA477349: Multispecies 16S rRNA from bushmeat samples collected from Tanzania.

    • PRJNA685168: Multi-omics response to biologic therapies in IBD.


Sample collection points

  • Does the metadata contains latitudes and longitudes (lat-lon) of the collection point?
  • You may consider dropping a pin on exact location.
  • The leaflet R package can do a great job in dropping a pin on the corresponding coordinate.
  • Note that samples collected on the same coordinate will overlap.
  • You can zoom in-out to expand or minimize the map.
  • You can also mouse over the pin to see the variable label.







References

[1]
In-GitHub. (2023). Official repository for citation style language (CSL). Accessed on february 06, 2023. Retrieved from https://github.com/citation-style-language/styles



Appendix

Static Snakemake report

The interactive snakemake html report can be viewed by opening the report.html using any compartible browser. You will be able to explore the workflow and the associated statistics. You will also be able to close the left bar to get a better wider view of the display.



Troubleshooting

  1. CiteprocXMLError: Missing root element
    • Maybe the CSL file is empty. Some examples of citation style language are available on Github[1].